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ABSTRACT 

A common consensus in the literature is that the citation 
profile of published articles in general follows a universal 
pattern - an initial growth in the number of citations within 
the first two to three years after publication followed by a 
steady peak of one to two years and then a final decline 
over the rest of the lifetime of the article. This observa¬ 
tion has long been the underlying heuristic in determining 
major bibliometric factors such as the quality of a publi¬ 
cation, the growth of scientific communities, impact factor 
of publication venues etc. In this paper, we gather and 
analyze a massive dataset of 1.5 million scientific papers 
from the computer science domain and notice that the ci¬ 
tation count of the articles over the years follows a remark¬ 
ably diverse set of patterns - a profile with an initial peak 
(Peaklnit), with distinct multiple peaks (PeakMul), with a 
peak late in time (PeakLate), that is monotonically decreas¬ 
ing (MonDec), that is monotonically increasing (Monlncr) 
and that can not be categorized into any of the above (Oth). 
We conduct a thor¬ 
ough experiment to 
investigate several 
important charac¬ 
teristics of these 
categories such as 
how individual cat¬ 
egories attract ci¬ 
tations, how the 
categorization is in¬ 
fluenced by the 
year and the venue 
of publication of 
papers, how each 
category is affected 
by self-citations, the 
stability of the cat¬ 
egories over time, 
and how much each 
of these categories 
contribute to the 
core of the network. Further, we show that the traditional 
preferential attachment models fail to explain these cita¬ 
tion profiles. Therefore, we propose a novel dynamic growth 
model that takes both the preferential attachment and the 
aging factor into account in order to replicate the real-world 
behavior of various citation profiles. We believe that this 
paper opens the scope for a serious re-investigation of the 


existing bibliometric indices for scientific research. 

1. INTRODUCTION 

Quantitative analysis in terms of counting, measuring, 
comparing quantities and analyzing measurements is per¬ 
haps the main tool to understand the impact of science. 
With the progress of time, scientific research itself by record¬ 
ing and communicating research results through scientific 
publications, has become enormous and complex. The com¬ 
plexity has become so specialized that individual under¬ 
standing and experience are no longer sufficient to unfold 
trends or for making crucial decisions. Therefore, an exhaus¬ 
tive analysis of research outputs in terms of scientific publi¬ 
cations is of great interest among scientific communities to 
be selective, 

to highlight signif¬ 
icant or promising 
areas of research, 
and to manage bet¬ 
ter investigation in 
science |23l M HI 
[26] . Bibliomet- 
rics (aka Sciento- 
metrics) ,4j [30], 
the application of 
quantitative anal¬ 
ysis and statistics 
to publications such 
as research articles and their accompanying citation counts, 
turns out to be the main tool for such an investigation. From 
the pioneering research of Garfield | 14 |. the use of citation 
analysis in bibliographic research serves as the fundamental 
quantifier for evaluating the contribution of researchers and 
research outcomes. 

Citation network represents the knowledge graph of sci¬ 
ence where individual papers are knowledge sources and 
their interconnectedness in terms of citation represents the 
relatedness among various kinds of knowledge. For instance, 
a citation network is considered to be an effective proxy 
for studying disciplinary knowledge flow, is used to discover 
knowledge backbone of a particular research area, and helps 
in grouping similar kinds of knowledges and ideas. Numer¬ 
ous research have been conducted on citation networks and 
their evolution over time. There is already a well-accepted 
belief about the dynamics of citations that a scientific ar¬ 
ticle receives after publication - an initial growth (growing 
phase) in the number of citations within the first two/three 


Key insights: 

• Analyzing a massive dataset 
of computer science domain 
reveals six distinctive citation 
trajectories of scientific arti¬ 
cles. 

• After suitable characteriza¬ 
tions of these profiles, major 
modifications of the existing 
bibliographic indices seem to 
be a compelling task. 

• Unlike existing network- 
growth models, these tra¬ 
jectories can only be repro¬ 
duced once both “preferential 
attachment” and “aging” are 
taken into account together. 


According to Eugene Garfield, 
a citation is nothing but 
a means to (i) pay homage 
to pioneers, (ii) give credit 
for related work (homage to 
peers), (iii) identify methodol¬ 
ogy, equipment etc., (iv) pro¬ 
vide background reading, (v) 
correct one’s own work or the 
work of others and so on. 







Figure 1: (Color online) A hypothetical example 
showing the traditional belief in the pattern of ci¬ 
tation profile of a scientific paper after publication. 


years after publication followed by a steady peak of one to 
two years (saturation phase) and then a final decline over 
the rest of the lifetime of the article (decline and obsolete 
phases) as shown in Figure |T] 1 151 j TB] |17) . In most cases, 
the above observation has been drawn from the analysis of a 
very limited set of publication data 00 , thus, obfuscating 
the true characteristics. Here, we conduct our experiment 
on a massive bibliographic dataset of the computer science 
domain comprising more than 1.5 million papers published 
between 1970 and 2010. Strikingly, unlike earlier observa¬ 
tions about citation profile of a paper, we notice six different 
types of citation profiles prevalent in the dataset (namely, 
Peaklnit, PeakMul, PeakLate, MonDec, Monlncr and Oth). 
We exhaustively an¬ 
alyze these pro¬ 
files to exploit the 
micro-dynamics con¬ 
trolling the actual 
growth of the un¬ 
derlying citation net¬ 
work that has re¬ 
mained unexplored 
in the existing lit¬ 
erature. This cat¬ 
egorization allows 
us to propose a 
holistic view of the 
growth of citation 
network through a 
dynamic model that 
takes into account 
the well-accepted concept of preferential attachment [n Em 
125] along with the aging factor m of scientific articles in 
order to reproduce different citation profiles observed in the 
real-world dataset. To the best of our knowledge, this is 
the first attempt to consider these two factors together in 
synthesizing the dynamic growth process of citation profiles. 
We believe that the key observations made in this paper will 
not only help in reformulating the existing bibliographic in¬ 
dices such as Journal Impact Factor (JIF), but will also en¬ 
hance the general bibliometric research such as citation link 
prediction, information retrieval and self-citation character¬ 
ization. 


• Earlier, citation trajectory 
of an article was assumed 
to be increasing initially and 
then following a downward 
growth. 

• We observe six distinct cita¬ 
tion trajectories after analyz¬ 
ing a massive dataset of com¬ 
puter science domain. 

• Since the citation profile can 
be categorized into at least six 
different types, all measures 
of scientific impact (e.g., im¬ 
pact factor) require a serious 
revisit. 



Figure 2: A systematic flowchart demonstrating the 
rules for classifying the training samples. 


2. A MASSIVE PUBLICATION DATASET 

Most experiments in the literature on analyzing citation 
profiles have worked with small datasets. However in this 
experiment, we gather and analyze a massive dataset to val¬ 
idate our hypothesis. We have crawled one of the largest 
publicly available datasets from Microsoft Academic Search 
(MASlJ which houses over 4.1 million publications and 2.7 
million authors with updates added every week [9]. We col¬ 
lected all the papers published in the computer science do¬ 
main and indexed by MAffl The crawled dataset contains 
more than 2 million distinct papers altogether which are 
further distributed over 24 fields of computer science do¬ 
main (as categorized by MAS). Moreover, each paper comes 
along with various bibliographic information - the title of 
the paper, a unique index for the paper, its author(s), the 
affiliation of the author(s), the year of publication, the pub¬ 
lication venue, the related field(s) of the paper, the abstract 
and the keyword(s). 

In order to remove the anomalies that crept in due to 
crawling, the dataset was passed through a series of initial 
preprocessing stages. The filtered dataset contains more 
than 1.5 million papers with 8.68% papers belonging to 
multiple fields (act as interdisciplinary papers). We have 
made the dataset publicly available at http://cnerg.org 
(see “Resources” tab). 

3. CATEGORIZATION OF CITATION PRO¬ 
FILES 

Since the primary focus of our study is to analyze citation 
growth of a paper after publication, an in-depth understand¬ 
ing of how, after publication, the number of citations of a 
paper varies over the years is necessary. We therefore con- 

academic .research.microsoft.com 

2 The crawling process took around six weeks and completed 
in August, 2013. 






























Figure 3: (Color online) Citation profiles for the first five categories obtained from analyzing the real-world 
citation dataset (top panel) and a comparison of that with the results obtained from the model (bottom 
panel). Each frame corresponds to each category. The ‘Oth’ category does not follow any consistent pattern 
and is therefore not shown here. In each frame, a citation belt is formed by the lines Q\ (green line) and Q 3 
(blue line) which represent the first (10% points lie below this line) and third (10% points lie above this line) 
quartiles of the data points respectively (i.e., effectively 80% points are within citation belt), and the red 
line drawn within the citation belt represents the average behavior of all the profiles corresponding to that 
category. Top panel: for each category, one representative citation profile (taken from real data) is shown at 
the middle of the belt (broken black line). Bottom panel: the color coding is similar to that of the top panel; 
however, the broken lines are the results obtained from our model. 


duct an exhaustive analysis of the citation patterns of dif¬ 
ferent papers present in our dataset. Some of the previous 
experimental results [Mil show that the trend of citations 
received by a paper after its publication date is not linear in 
general; rather there is a fast growth of citations within the 
initial few years, followed by an exponential decay. This con¬ 
clusion has been drawn mostly from the analysis of a small 
dataset of publication archive. In this work, for an extensive 
analysis, we first take all the papers having at least 10 years 
of citation history, and consider maximum 20 years of their 
citation history. This is followed by a series of data pro¬ 
cessing steps. First of all, to smoothen the time series data 
points in the citation profile of a paper, we use five-years 
moving average filtering; then, we scale the data points by 
normalizing them with the maximum value present in the 
time series (i.e, maximum number of citations received by 
the paper in some particular year); finally, we run local peak 
detection algorithrr@ to detect peaks in the citation profile. 
In addition, we apply the following two heuristics to specify 
peaks: (i) the height of a peak should be at least 75% of 
the maximum peak-height, and (ii) two consecutive peaks 
should be separated by more than 2 years; otherwise they 
are treated as a single peak. A systematic flowchart to de¬ 
tect each category is shown in Figure [2] 

Remarkably, we notice that a major proportion of papers 
do not follow the traditional citation profile mentioned in 
the earlier studies (see Figure HJ; rather there exist six dif¬ 
ferent types of citation profiles of research papers based on 
the count and the position of peaks present in a profile. The 
definition of six types of citation profiles with the individual 


The peak detection algorithm is available in Matlab Spectral Analysis pack¬ 
age - http://www.mathworks.in/help/signal/ref/findpeaks.html we use ‘MINPEAKDIS- 
TANCE’ = 2 and ‘MINPEAKHEIGHT’ = 0.75 and the default values for the other 
parameters. 



Figure 4: (Color online) Percentage of papers in six 
categories for different research fields of computer 
science domain. For most of the fields, the pattern 
is similar, except World Wide Web. 

proportions in the entire dataset are give below: 

(i) Peaklnit: Papers whose citation count peaks within the 
first 5 years of publication (but not in the first year) followed 
by an exponential decay (proportion: 25.2%) (Figure^ a)). 

(ii) PeakMul: Papers having multiple peaks at different 
time points of the citation profile (proportion: 23.5%) (Fig¬ 
ure IHb)). 

(iii) PeakLate: Papers having very few citations at the 
beginning and then a single peak after at least 5 years of 
the publication which is followed by an exponential decay in 
the citation count (proportion: 3.7%) (Figure E]A)). 

(iv) MonDec: Papers whose citation count peaks in the 
immediate next year of the publication followed by a mono¬ 
tonic decrease in the number of citations (proportion: 1.6%) 
(Figure EEd)). 

(v) Monlncr: Papers having a monotonic increase in the 
number of citations from the very beginning of the year of 
publication till the date of observation (i.e., it can be after 
20 years of its publication) (proportion: 1.2%) (Figure[3|e)). 

(vi) Oth: Apart from the above types, there exist a large 
number of papers which on an average usually receive less 
































than one citation per year. For these papers, the evidences 
are not significant enough for assigning them into one of the 
above categories, and, therefore, they remain as a separate 
group altogether (proportion: 44.8%). 

The rich metadata information in the dataset further al¬ 
lows us to conduct a second level analysis of these cate¬ 
gories for different research fields of computer science do¬ 
main. 

We measure the 
percentage of pa¬ 
pers in different 
categories for each 
of the 24 research 
fields after filter¬ 
ing out all the pa¬ 
pers in the Oth 
category. Surpris¬ 
ingly, we notice 
that while for all 
other fields, max¬ 
imum fraction of 
papers belong to 
the PeakMul cate¬ 
gory, for the field 
World Wide Web 
(WWW) this frac¬ 
tion is maximum 
in the Peaklnit category (see Figure(4]). The possible reason 
could be that since WWW is mostly a conference-based field 
of research, the papers in Peaklnit category mostly dominate 
this field (see Section ^.II) . 



Figure 5: (Color online) Contribution of papers 
from each category in different citation buckets. The 
entire range of citation value present in the dataset 
is divided into seven buckets. In each bucket, the 
contribution of papers from a particular category is 
normalized by the total number of papers in that 
category. 

4. CONTRIBUTION OF CATEGORIES IN 
DIFFERENT CITATION RANGES 

One of the fundamental aspects of analyzing scientific 
publications is to measure how acceptable they are to the 
research community. This is often measured by the raw ci¬ 
tation count - the more citations an article receives from 
other publications, the more it is assumed to be admired 
by the researchers and hence the more is the scientific im¬ 
pact [B|. In the current context, an interesting question is 
- which among the six categories contains papers that are 


admired most in terms of citations. In order to answer this 
question, we conduct a systematic study - the total cita¬ 
tion range is divided into four buckets (the citation ranges 
are: 11-12, 13-15, 16-19, 20-11408) such that each citation 
bucket would contain almost equal number of papers. For a 
deeper analysis of the highest citation range, we further di¬ 
vide the last bucket (20-11408) into four more ranges, thus 
obtaining seven buckets altogether. Then we measure the 
proportion of papers contributed by a particular category 
to a citation bucket (see Figure 0. Note that in each ci¬ 
tation bucket, the number of papers contributed by a cat¬ 
egory is normalized by the total number of papers belong¬ 
ing to that category. Therefore, this figure is a histogram 
of conditional probability distribution - probability that a 
randomly selected paper falls in citation bucket i given that 
it belongs to category j. The normalization is required in 
order to avoid population bias across different categories. 
We observe that 
the higher region 
of citation is mostly 
occupied by the 
papers in Peak- 
Late and Monlncr 
categories followed 
by PeakMul and 
Peaklnit. We also 
notice that the Mon- 
Dec category which 
has the minimum 
proportion in the 
last citation bucket 
shows a monotonic 
downward fall in the fraction of papers as the citation range 
increases. These initial evidences present a general and non- 
intuitive interpretation of citation profiles - if a paper does 
not obtain high citations within the immediate few years 
after its publication, it does not necessarily mean that it 
will continue to remain low impact all through its lifetime; 
rather in future its citation growth rate might accelerate and 
it could indeed turn out to be a well accepted paper in the 
scientific community. We further explain this behavior in 
the next section. 

Table 1: Mean publication year Y (its standard de¬ 
viation cr(Y)) and the percentage of papers in con¬ 
ferences and journals for each category of citation 
profile. 


Category 

Mean publication 
year (<r(T)) 

% of conference 
papers 

% of journal 
papers 

Peaklnit 

1994 (5.19) 

64.35 

35.65 

PeakMul 

1991 (6.68) 

39.03 

60.97 

PeakLate 

1992 (6.54) 

39.89 

60.11 

MonDec 

1994 (5.44) 

60.73 

39.27 

Monlncr 

1993 (7.36) 

25.26 

74.74 


5. CHARACTERIZING DIFFERENT CITA¬ 
TION PROFILES 

The rich metadata information of the publication dataset 
further allows us to understand the characteristic features 
of each of these six categories at finer levels of detail. 

5.1 Influences of publication year and publi¬ 
cation venues on the categorization 


• WWW has the majority of 
papers in the Peaklnit cate¬ 
gory. 

• Among all fields, Simula¬ 
tion and Computer Educa¬ 
tion have the highest propor¬ 
tion of papers in the Mon- 
Dec category, while Bioinfor¬ 
matics and Machine Learning 
have the lowest. 

• Security and Privacy as 
well as Bioinformatics have 
the highest proportion of pa¬ 
pers in the PeakLate category, 
while Simulation and WWW 
have the lowest. 


• Papers in PeakLate and 
Monlncr categories seem to 
receive maximum citations. 

• Papers in MonDec category 
mostly fall in the low citation 
region. 

• First few years’ citation 
counts might not be a good in¬ 
dicator of the ultimate extent 
of acceptance of a paper in the 
scientific community. 

























































One might raise an immediate question that this catego¬ 
rization might be influenced by the time (year) when the pa¬ 
pers are published, i.e., the papers published earlier might be 
following the well-known behavior whereas the papers pub¬ 
lished recently might indicate a different behavior. In order 
to verily that the categorization is not biased by the publica¬ 
tion time period, we measure the average year of publication 
of the papers in each category. From the second column of 
Table [T] we can conclude that the citation pattern of the 
papers is not bi¬ 
ased by the pub¬ 
lication year since 
the average years 
roughly correspond 
to the same time 
period. On the 
other hand, the 
mode of publica¬ 
tion in conferences 
is significantly dif¬ 
ferent from that 
of journals, and 
therefore the cita¬ 
tion profiles of pa¬ 
pers published in 
these two venues 
are also expected 
to be different. To 
analyze the venue 
effect on the cat¬ 
egorization, we mea¬ 
sure the percentage of papers published in journals vis-a-vis 
in conferences for each category as shown in the third and 
the fourth columns of Table [T] respectively. We observe that 
while most of the papers in Peaklnit (64.35%) and MonDec 
(60.73%) categories are published in conferences, papers be¬ 
longing to PeakLate (60.11%) and Monlncr (74.74%) cat¬ 
egories are mostly published in journals. Hence, if a pub¬ 
lication starts receiving greater attention or citations at a 
later part of its lifetime, it is more likely to be published in 
a journal and vice versa. 

Another interesting point to be noted from these results is 
that although the existing formulation of the Journal Impact 
Factor [14] has been defined taking into consideration the 
citation profile as shown in Figure [H most of the journal 
papers which fall 
in PeakLate or Mon¬ 
lncr do not follow 
such a profile at 
all; at least for pa¬ 
pers in PeakLate 
category, the met¬ 
ric does not fo¬ 
cus on the most¬ 
relevant time frame 
of the citation profile (mostly after first 5 years of publica¬ 
tion). In the light of the current results, the appropriateness 
of the formulation of the bioliogaphic metrics such as Jour¬ 
nal Impact Factor remain doubtful. 

5.2 Effect of self-citation on the categorization 

Another factor that often affects citation rate is self-citation 


The Impact Factor [T3] of 
a journal at any given time 
is the average number of cita¬ 
tions received per paper pub¬ 
lished in that journal during 
the two preceding years. 


• Due to the increasing pop¬ 
ularity of conferences in an 
applied domain like computer 
science, the conference papers 
get quick publicity within a 
few years after publication, 
which is also the reason for the 
rapid decay of their popular¬ 
ity. 

• In contrast, journal papers 
usually take time to get pub¬ 
lished and also to get popu¬ 
larity, thus being mostly ad¬ 
mired much later after publi¬ 
cation. However, most of the 
journal papers remain consis¬ 
tent in receiving citations even 
many years after their publi¬ 
cation. 


Table 2 : (Color online) Confusion matrix represent¬ 
ing the transition of categories due to the removal 
of self-citations. A value x in the cell (i,j) repre¬ 
sents that x fraction of papers in category i would 
have fallen in category j if self-citations were ab¬ 
sent in the entire dataset. Note that, no row has 
been specified for Oth category because papers from 
this category can never move to the other categories 


through deletion of citations. 


Category 

Peaklnit 

PeakMul 

PeakLate 

MonDec 

Monlncr 

Oth 

Peaklnit 

0.72 

0.10 

0.03 

0.01 

0 

0.15 

PeakMul 

0.02 

0.81 

0.04 

0 

0.1 

0.11 

PeakLate 

0.01 

0.06 

0.86 

0 

0.01 

0.06 

MonDec 

0.05 

0.14 

0 

0.41 

0 

0.35 

Monlncr 

0 

0.02 

0.01 

0.01 

0.88 

0.09 



Figure 6: (Color online) Faction of self-citations per 
paper in different categories over different time pe¬ 
riods after publication; (inset) fraction of papers in 
each category migrating to Oth category due to re¬ 
moval of self-citations assuming different category 
thresholds. 


[12] , Self-citation can inflate the perception of an article’s 
or a scientist’s scientific impact, particularly when an article 
has many authors, increasing the possible number of self¬ 
citations HU El], and thus there have been calls to remove 
self-citations from citation rates [28]. We also conduct a sim¬ 
ilar experiment to notice the effect of self-citation on the cat¬ 
egorization of citation profiles. 

Essentially, we first 
dispose the cita¬ 
tion from the dataset 
if the citing and 
the cited papers 
have at least one 
author in common, 
and then measure 
what fraction of 
papers in each cat¬ 
egory migrate to 
some other cate¬ 
gory due to this 
disposal. Table [2] 
presents a confu¬ 
sion matrix where 
labels in the rows 


• Authors tend to cite their 
own papers within 2-3 years of 
their publications in order to 
increase the visibility of their 
work to the audience. 

• MonDec and Peaklnit cat¬ 
egories (i.e., mostly the con¬ 
ference papers) are highly af¬ 
fected by the self-citation. 

• The self-citation is usually 
used in initial periods of the 
publication by the authors in 
an attempt to increase the vis¬ 
ibility of their publications in 
the scientific community. 
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Figure 7: (Color online) Alluvial diagram representing the evolution of papers in different categories and the 
flows between the categories at time T + 10, T+15 and T + 20. The colored blocks correspond to different 
categories. The size of the block indicates the number of papers in that category, and the shaded waves 
joining the regions represent flow of papers between the regions, such that the width of the flow corresponds 
to the fraction of papers. The total width of incoming flows is equal to the width of the corresponding region. 


and the columns 
represent the cate¬ 
gories before and after removing self-citations respectively. 
We observe that papers in MonDec are vastly affected by the 
self-citation phenomenon. Around 35% of papers in Mon¬ 
Dec would have been in the ‘Oth’ category had it not been 
due to the self-citations. However, one might argue that this 
might be the artifact of the thresholding that we impose (see 
Section 0 to categorize papers (papers receiving less than 
or equal to 10 citations in first 10 years after publication 
are considered to be in ‘Oth’ category). In order to verify 
the effect of thresholding on the inter-category migration af¬ 
ter removing self-citations, we vary the category threshold 
from 10-14 and plot in Figure [6] (inset) the fraction of pa¬ 
pers in each category migrating to Oth category due to the 
disposal of self-citations. The result agrees with the observa¬ 
tion noted in Table0 MonDec category is mostly affected by 
self-citations, followed by Peaklnit, PeakMul and PeakLate. 
This indicates that the effect of self-citations is due to the 
inherent characteristics of each category, rather than due to 
the predefined threshold setting of the category boundary. 

In Figure[6l we show how the self-citations are distributed 
across different time periods for individual categories. For 
each category, we first aggregate all the self-citations and 
plot the fraction of self-citations over the time after publi¬ 
cation. As expected, for MonDec category we observe that 
most number of self-citations are “farmed” within 2-3 years 
after publication. Similar observations hold for both the 
Peaklnit and Oth categories. Note that, as observed earlier, 
Peaklnit and MonDec categories are mostly found to be con¬ 
ference papers. Therefore, we can conclude that conference 
papers tend to be mostly affected by self-citations. How¬ 
ever, the characteristics of the highly-cited categories such 
as Monlncr and PeakLate are mostly consistent throughout 
the years which show that these categories are less depen¬ 
dent on self-citations. 

6. ANALYZING STABILITY OF DIFFERENT 
CATEGORIES 


The number of citations for a paper changes over time 
depending on its long/short lasting effect on the scientific 
community which in turn might change the shape of the ci¬ 
tation profile. Therefore, studying the temporal evolution 
of each citation profile can help us understand the stabil¬ 
ity of the categories individually. Since, we know the cat¬ 
egory of those papers that have at least 20 years of cita¬ 
tion history, for each such paper we further analyze how 
the shape of the profile evolves through these 20 years time¬ 
line. Essentially, after publication of a paper at time T, we 
identify its category at time T + 10, T+15 and T + 20 
based on the heuristics discussed earlier. We hypothesize 
that a stable citation category tends to maintain its shape 
throughout the entire timeline. The colored blocks of the 
alluvial diagram [27| in Figure 0 correspond to the different 
categories for three 
different timestamps. 

We observe that 
apart from the Oth 
category which has 
a major propor¬ 
tion of papers, Mon¬ 
Dec seems to be 
the most stable, 
which is followed 
by Peaklnit. How¬ 
ever, papers which 
are assumed to fall 
in Oth category quite often turn out to be Monlncr papers 
in the later time periods. This analysis indeed demonstrates 
a systematic approach to unfold the transition from one cat¬ 
egory to another taking place in scientific research with the 
increase of citations. 

7. CORE-PERIPHERY ANALYSIS 

Although Figure [o] indicates the impact of different cate¬ 
gories in terms of raw citation count, it neither unfolds the 
significance of the papers in each category forming the core 
of the network nor gives us any information regarding the 
temporal evolution of the structure. For a better and more 


• Papers in Oth category of¬ 
ten shift to Monlncr category 
by acquiring more citations in 
the later time period. 

• If a paper falls in either 
MonDec or Peaklnit category 
earlier, the likelihood that it 
would shift to some other cat¬ 
egory is less. 
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Figure 8: (Color online) Multi-level pie chart for year 2000, 2004, 2007 and 2010 showing the composition of 
each of the categories in different fc s -shell regions; where the colors represent different categories and the area 
covered by each colored region in each fc s -shell denotes the proportion of papers in the corresponding category 
occupied in that shell. The innermost shell is the core region and the outermost shell is the periphery region. 
For better visualization, we divide the total number of shells identified from the citation network in each year 
into six broad shells; thus the core-periphery structure in each year has six concentric layers. 


detailed understanding, we perform k -core analysis mm] of 
the evolving citation network by decomposing the network 
for each year into its fc s -shells such that an inner shell in¬ 
dex of a paper reflects a central position in the core of the 
network. 

We construct different aggregated citation networks in dif¬ 
ferent years - 2000, 2004, 2007 and 2010 such that citation 
network constructed in the year Y contains the induced sub¬ 
graph of all papers published at or before Y. Then for each 
such network, we run the following methods: we start by re¬ 
cursively removing nodes that have a single inward link until 
no such nodes remain in the network. These nodes form the 

1- shell of the network (fc s -shell index k s = 1). Similarly, 
by recursively removing all nodes with degree 2, we get the 

2- shcll. We continue increasing k until all nodes in the net¬ 
work have been assigned to one of the shells. The union of 
all the shells with index greater than or equal to k 3 is called 
the k 3 -core of the network, and the union of all shells with 
index smaller or equal to k 3 is the fc"-crust of the network. 
The idea is to show how the papers in each category (iden¬ 
tified in the year 2000) migrates from one shell to another 
after getting citations in the next 10 years, ft also allows us 
to observe how persistent a category is in a particular shell. 

In Figure |5] we notice that the majority of papers in the 
Oth category lie in the periphery and its proportion in the 
periphery increases over time which indicates that the pa¬ 
pers in this category are becoming increasingly less popular 
over time. PeakMul category gradually leaves the periph¬ 
eral region over time and mostly occupies the two innermost 
shells. Peaklnit and MonDec show almost similar behavior 
with a major proportion of papers in inner cores in the ini¬ 
tial year but gradually shifting towards peripheral regions. 
On the other hand, Monlncr and PeakLate show expected 
behavior with their proportion increasing in the inner shells 
over time indicating their rising relevance as time progresses. 
This study helps us identify temporal evolution of the im¬ 
portance of different categories in terms of how each of them 
contributes to the central position of the citation network. 

8. MODELING CITATION PROFILES US¬ 
ING DYNAMIC GROWTH MODEL 

There has been extensive research done in developing growth 
models to explain the evolution of citation networks PH (23- 
Models like Barabasi-Albert [DEI, Price [25] etc. attempt 


to generate scale-free networks using preferential attachment 
mechanism. Most of these works seek to explain the emer¬ 
gence of mainly the degree distribution of the network. In 
this paper, we propose a novel dynamic growth model to syn¬ 
thesize the citation network with the aim of reproducing the 
citation categories obtained from the real-world dataset. To 
the best of our knowledge, this model is the first of its kind 
which takes into account two major components, namely 
preferential attachment [l! and aging PH HU] in order to 
mimic the real-world citation profiles. 

We use the following distributions as inputs to the model 
for a fair comparison with the real-world citation profiles: 
distribution of the number of papers over the years (to deter¬ 
mine the influx of papers into the system at each time step), 
reference distribution (to determine the outward citations 
that would emanate from an incoming node). 
At each time step 
(corresponding to 
a particular year), 
a number of nodes 
(papers) are se¬ 
lected with the out- 
degree (references) 
for each of them 
determined prefer¬ 
entially from the 
reference distribu¬ 
tion. Then the 
vertex is assigned 
preferentially to a 
certain category based 
on the size of the 
categories (number 
of papers in the 
categories) at that 
time step. For 
determining the other 
end point of each 
edge associated with 
the incoming node, 
we first select a 
category preferen¬ 
tially based on the 
in-citation information of each category, and then within the 


• Preferential attachment 

means that the more con¬ 
nected a node is, the more 
likely it is to receive new con¬ 
nections. Preferential attach¬ 
ment is an example of a posi¬ 
tive feedback cycle where ini¬ 
tially random variations (one 
node initially having more 
links or having started accu¬ 
mulating links earlier than an¬ 
other) are automatically rein¬ 
forced, thus greatly magnify¬ 
ing differences. 

• In many growing networks, 
the age of the nodes plays an 
important role in deciding the 
attachment probability of the 
incoming nodes. For example, 
in a citation network, very old 
papers are seldom cited while 
recent papers are usually cited 
at a higher frequency. 





category we select a node preferentially based on its attrac¬ 
tiveness. The attractiveness of a node (paper) is determined 
by the time elapsed since its publication (aging) and the 
number of citation accumulated till that time (preferential 
attachment). Note that, the formulation of the attractive¬ 
ness in our model also varies for different categories (see SI 
text). 

We observe a remarkable resemblance between the real- 
world citation profiles and those obtained from the model as 
shown in Figure [3] (bottom panel: (al)-(el)). Each frame of 
this figure contains three lines depicting first quartile (10% 
points lie below this line), third quartile (10% points lie 
above this line) and the mean behavior. We also compare the 
in-degree distributions obtained from the model and from 
the real dataset for different categories and observe a signif¬ 
icant resemblance (see SI text). Hence our model presents a 
holistic view of the evolution of a citation network over time 
and the intra- and inter-category interactions that account 
for the observable properties of the real-world system. 

9. DISCUSSION 

The collection of massive computer science bibliographic 
dataset allows us to conduct such exhaustive analysis of ci¬ 
tation profiles of individual papers and to derive six predom¬ 
inant categories that remained unobserved in the literature 
so far. At the microlevel, this paper provides, for example, 
a set of new approaches to characterize each individual cate¬ 
gory as well as to study the dynamics of their evolution over 
time. Finally, leveraging on these behavioral signatures we 
are able to design a novel dynamic model to synthesize the 
real-world network evolving over time. This model in turn 
intrinsically unfolds the citation patterns of different cate¬ 
gories, which show a significant resemblance to that obtained 
from the real data. 

This paper thus offers a necessary first step towards refor¬ 
mulating the existing quantifiers available in Scientometrics 
that should leverage the signature of different citation pat¬ 
terns in order to formulate robust measures. Moreover, we 
believe that a systematic machine learning model of the be¬ 
havior of different citation patterns has the potential to en¬ 
hance the standard research methodology in this area which 
includes topics like discovering missing links in citation net¬ 
works m, early prediction of citations of scientific articles 
[31] , predicting high-impact and seminal papers [22], recom¬ 
mending scientific articles [5j etc. 

In future, we plan to extend our study on the datasets 
of other domains such as physics and biology to verify the 
universality of such categorizations. Moreover, we are keen 
to understand the micro-level dynamics controlling the be¬ 
havior of PeakMul category which is significantly different 
from the others. One initial observation towards this direc¬ 
tion is that PeakMul behaves like the intermediary between 
Peaklnit and PeakLate categories (see SI text). In future, 
we would like to conduct a detailed analysis to understand 
different characteristic features particularly for the PeakMul 
category. 


10. SUPPORTING INFORMATION 
10.1 Dynamic Growth Model 

Unlike the standard growth models proposed previously 
for citation networks, our model takes into account both 
“preferential attachment ” mug and “aging” [20] of each pa¬ 
per in order to synthesize the network. To begin with, we 
include the first six years (1970 - 1975 (inclusive)) network 
information from the real dataset to bootstrap the model. 
This information includes - induced subgraph of the papers 
published within these years, their category information and 
the year of publication. Apart from this, for fair comparison 
with the real-world dataset, we use two more distributions 
generated from the real-world dataset - year-wise distribu¬ 
tion of the number of publications (denoted as Pub-dist ) 
and the reference distribution (i.e., number of references vs. 
fraction of papers having those many references, denoted as 
Ref-dist). These form the base statistics of our model. 

Since we know the category information of the papers 
which are considered in the base statistics, we form six buck¬ 
ets corresponding to the six categories. Each bucket consti¬ 
tutes papers of the corresponding category. Now for insert¬ 
ing new papers in the network at each time step t (corre¬ 
sponds to a particular year), we execute the following steps 
in order: 

1. Select the actual number of papers (say, N) published at 
time t from Pub-dist , and create a set of N number of nodes 
(say, Pjv) to be inserted in the network. 

2. For each such p present in Pn- 

2.1 Select one of the buckets (categories) for p preferen¬ 
tially based on the size of the buckets. 

2.2 Select a value R from Ref-dist to determine the num¬ 
ber of outward edges (references) emanating from p. 

2.3 For each outward edge, select the other end-point 
using the following steps: 

2.3.1 Select a bucket B preferentially based on the 
number of incoming citations obtained by the 
papers in different buckets till time t — I. 

2.3.2 Select one of the papers from B based on its 

attractiveness at time t — 1. 

2.4 Finally, assign p in the selected category. 

The attractiveness (in) of a paper pi is determined by the 
category where it falls using the two factors - the time af¬ 
ter its publication (aging) and the total citation count it 
has accumulated so far (preferential attachment) as follows. 
The notations used to describe the formulation of irt for 
the different categories are - kp. the in-degree of paper p;, 
fi\ average citation count of papers present in the category 
where pi belongs to, p: a parameter of the model used to 
dampen certain factors associated with it, r: a parameter of 
the model used to scale the time of occurrences of the peaks. 

• MonDec: Each paper declines in its popularity ex¬ 
ponentially with time but remains proportional to its 
overall significance, i.e., 


• Peaklnit: We assume T as the occurrence of the early 
peak in such distribution. Peaklnit pattern comprises 
two phases as commonly hypothesized for most of the 
earlier papers - initial increase through preferential at¬ 
tachment and then a continuous decline. Therefore, 

= ki + p.-, if t <T + t 




PeakLate MonDec 
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Figure 9: Figure SI. Comparison of the in-degree distributions obtained from real-world dataset and the 
model separately for different categories (both the axes are in log-scale). 


7 n = ki ^ ; otherwise 

• PeakLate: We assume T as the occurrence of the late 
peak in such distribution. PeakLate behaves similarly 
as Peaklnit with the only difference that it experiences 
its peak at a much later period after the time of pub¬ 
lication. Therefore, 

7Ti = ki + /r; if t <T + r 
_ fci+e . otheriuise 

P t > 

• PeakMul: We empirically observe that about 65.04% 
papers in PeakMul category have two peaks on an av¬ 
erage in the timeline of the citation profile. Therefore, 
in this model we assume that Ti and T 2 are the times of 
two peaks respectively. The pattern of this profile can 
be sub-divided into four phases - initial rise through 
preferential attachment followed by a monotonic de¬ 
cline, a second rise through preferential attachment 
followed by a monotonic decline. Therefore, 

7Ti = ki + fi; if 0 < t < Tr + r 
m if 7i + r < t < + T 

7Ti = ki + n; if Tl \ T2 +t <t < T 2 + t 
7 n = ; otherwise 

• Monlncr: In this category, each paper acquires popu¬ 
larity directly proportional to its earlier citation count 
as well as time. Thus, 

7r,; — k; T p/r -f-t; Vt 

• Oth: Since the papers in this category do not follow 
any particular pattern of citation profile, we do not add 
the aging factor here. When a new added paper tries 
to connect the references to the papers in this category, 
the papers are selected uniformly at random, i.e., the 
7r values for all the papers in this category are same 
and remain constant over time. 

The values of 7r per paper and the value of p for each cat¬ 
egory are calculated on the fly at each time t based on the 
information at time t — 1. The time of occurrences of peaks 
(T, Ti and T 2 ) are selected from the actual distribution of 
peak occurrence time for each category individually. The 
results in Figure 3 (bottom panel) of the main text are an 
outcome of 100 simulation averages. The final values of the 
model parameters determined in order to closely match the 
simulation results with the real-world patterns are as fol¬ 
lows: r = 1 (Peaklnit) and =3 (rest); p = 0.25 (MonDec), 
=0.7 (Peaklnit), =0.5 (PeakLate) and =0.3 (Monlncr). 
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Figure 10: Figure S2. Hypothetical lines showing 
number of peaks, average height and average time 
of occurrences of the peaks for Peaklnit, PeakMul 
and PeakLate categories. 


10.2 Comparing the Emergent Degree Distri¬ 
butions for Different Categories 

In the earlier section, we have proposed a model that 
replicates different citation patterns observed in real-world 
dataset. In Figure 3 (bottom panel) of the main text, we 
have observed that the results obtained from our dynamic 
model have a significant resemblance with the citation pro¬ 
files observed in the real data. We further evaluate our 
model in terms of degree distribution. Since we use out- 
degree distribution (reference distribution) in our model as 
an input, we evaluate the model with respect to the in-degree 
distribution. Figure SI shows the in-degree distributions ob¬ 
tained from the model and the real-world dataset separately 
for different categories. We observe that the outputs of the 
model have a significant resemblance with the real-world re¬ 
sults. This in turn further strengthens the applicability of 
our model to reproduce the real-world phenomenon. 

10.3 Intermediary Behavior of PeakMul Cat¬ 
egory 

PeakMul seems to have a significantly unique citation pro¬ 
file in comparison to all the other categories. For instance, 
contrary to the other categories where the profile is mostly 
determined by the height and the time of peak occurrence, 
the characteristic of this category is controlled by one more 
parameter - the number of peaks. Therefore for the Peak¬ 
Mul category shown in Figure 3 of the main text, the line 





































depicting the average behavior (red line) does not have much 
resemblance with the corresponding representative instance 
(broken black line). In Figure S2, we show the average 
height and the average time of occurrences of the peaks for 
Peaklnit, PeakMul and PeakLate categories. A deeper anal¬ 
ysis unfolds three interesting observations - (i) most of the 
papers (65.04%) in PeakMul category have two peaks on an 
average in the timeline of the citation profile, (ii) the sum of 
the average heights of first two peaks in PeakMul category 
(i.e, 3.1 + 2.5 = 5.6) is (nearly-)similar to the height of the 
peak for Peaklnit and PeakLate categories (5.1 and 5.3 re¬ 
spectively), (iii) the average difference between the time of 
occurrences of the first two peaks in PeakMul category (i.e., 
12.1 — 5.3 = 6.8) is (nearly-)similar to the difference of the 
occurrence of the peak in PeakLate and Peaklnit categories 
(i.e., 10.8 — 4.2 = 6.6). From these observations, one could 
argue that PeakMul behaves like the intermediary between 
Peaklnit and PeakLate categories. We have used these ob¬ 
servations in order to configure the dynamic growth model 
described in Section riO.il The detailed analysis of this cate¬ 
gory remains as one of the potential areas of future research. 
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